A Method for Finding Similar Documents Relying on Adding Repetition of Symbols in Length Based Filtering

نویسندگان

  • Hossein Azgomi
  • Masumeh Ghasemi Mahsayeh
  • Masoud Mohammadi
  • Milad Moradi
چکیده

A basic topic in mining of massive dataset is finding similar items. As an example, finding similar documents can be recommended. In this case many methods are existed. For example, Shingling method and length based filtering are one of them. In Shingling method, from each document, substrings have been selected with symbol name and, they are placed on one set. For finding similar documents, the similarities of sets that related with them have been calculated. In Length based filtering just documents which close these lengths have been compared. These methods don’t consider repetition of symbols. With considering the repetition can calculate length of documents with more accurately. In this paper we suggested a method for finding similar documents with considering the repetition of symbols. This method separated documents to better form. The main goal of this paper is presentation a method for finding similar documents with take fewer comparisons and time indeed.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Adaptive Hierarchical Method Based on Wavelet and Adaptive Filtering for MRI Denoising

MRI is one of the most powerful techniques to study the internal structure of the body. MRI image quality is affected by various noises. Noises in MRI are usually thermal and mainly due to the motion of charged particles in the coil. Noise in MRI images also cause a limitation in the study of visual images as well as computer analysis of the images. In this paper, first, it is proved that proba...

متن کامل

Spatial and symbolic recognition of Chinese mosques

The history of Islam in China began when the first ambassador of Islamic caliphate in 654 AD, gained the court of the Chinese emperor. After that Islam has been spread throughout there during a century. In this study, authors try to study about how architectural elements and spatial forms are effected from Islam or Buddhist-Chinese tradition. Then, at the first it must be clear that which symbo...

متن کامل

Connected Component Based Word Spotting on Persian Handwritten image documents

Word spotting is to make searchable unindexed image documents by locating word/words in a doc-ument image, given a query word. This problem is challenging, mainly due to the large numberof word classes with very small inter-class and substantial intra-class distances. In this paper, asegmentation-based word spotting method is presented for multi-writer Persian handwritten doc-...

متن کامل

QoS-based Web Service Recommendation using Popular-dependent Collaborative Filtering

Since, most of the organizations present their services electronically, the number of functionally-equivalent web services is increasing as well as the number of users that employ those web services. Consequently, plenty of information is generated by the users and the web services that lead to the users be in trouble in finding their appropriate web services. Therefore, it is required to provi...

متن کامل

A Novel Trust Computation Method Based on User Ratings to Improve the Recommendation

Today, the trust has turned into one of the most beneficial solutions to improve recommender systems, especially in the collaborative filtering method. However, trust statements suffer from a number of shortcomings, including the trust statements sparsity, users' inability to express explicit trust for other users in most of the existing applications, etc. Thus to overcome these problems, this ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1712.03190  شماره 

صفحات  -

تاریخ انتشار 2014